Appendix E — Assignment E

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

  3. Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on Monday, 5th June 2023 at 11:59 pm.

  5. All the estimated code execution times in this assignment are based of an instance of n1-standard-32 (32 cores virtual machine) on Google colab.

  6. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (2 pts).
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
  • Final answers of each question are written in Markdown cells (1 pt).
  • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

E.1 Conceptual

E.1.1 Ensembling

Is it possible for an ensemble model to perform worse than one or more of the individual models? Why or Why not?

(1 + 4 points)

E.1.2 Ensemble fail

If an ensemble model does perform worse than one or more of the individual models, then what should be the course of action?

(3 points)

E.2 Regression Problem - Miami housing

E.2.1 Data preparation

Read the data miami-housing.csv. Check the description of the variables here. Split the data into 60% train and 40% test. Use random_state = 45. The response is SALE_PRC, and the rest of the columns are predictors, except PARCELNO. Print the shape of the predictors dataframe of the train data.

(1 point)

E.2.2 MARS model

Develop a MARS model to predict SALE_PRC based on all the predictors. Compute the MAE on test data.

Assume that you have used GridSearchCV to tune the max_degree hyperparameter of the model, and the optimal value comes out to be max_degree = 3. Use this value to train the model.

Estimated code execution time: 1 minute

The test MAE should be around $55,000.

(2 points)

E.2.3 Bagged MARS model

Bag 20 MARS model with the same value of max_degree, and report the test MAE based on the bagged MARS model.

Estimated code execution time: 5 minutes

The test MAE should be around $51,000.

(4 points)

E.2.4 Voting ensemble

Develop a voting ensemble model with:

  1. The bagged MARS model developed in E.2.3,

  2. The tuned bagged tree model developed in C.1.6.2

  3. The tuned random forest model developed in C.1.8.1

  4. The tuned AdaBoost model developed in D.2.2

  5. The tuned Gradient boosting model (with Huber loss) developed in D.2.6

  6. The tuned XGBoost model developed in D.2.10

Report the MAE of each of the above models (1-6), and the voting ensemble.

The MAE of the voting ensemble is likely to be higher than some of the individual models, as these models have a broad range of MAEs (see equation 10.1 in class notes).

*Note:**

1. If you had replaced the boosting models in (5) and (6) with other boosting models, you can use those.

2. You may either use the function VotingRegressor() or just take the average of the predictions of all the models and compute the MAE. The latter will be quicker as you have already fitted the individual models to compute their predictions and respective MAEs, so you don’t need to fit the models again with VotingRegressor()

(6 + 4 points)

E.2.5 Voting ensemble with good models

Only ensemble those models that have comparable MAEs and relatively low MAEs. These are likely to be models (5) and (6) in the previous question (E.2.4). Report the MAE of this voting ensemble.

This ensemble is likely to have a lower MAE than each of the models 1-6 in the previous question (E.2.4).

(4 points)

E.2.6 Stacking ensemble with Linear regression

Develop a linear regression metamodel based on models 1-6 in E.2.4. Report the MAE of the metamodel on test data. Which model has the highest weight in the ensemble?

Note:

  1. You may use the StackingRegressor() function. However, as the next set of questions ask you to develop different metamodels based on the models 1-6 in E.2.4, using the StackingRegressor() will be inefficient as it will involve fitting each of the individual models every time it is called.

  2. A faster way will be to use the cross_val_predict() function to compute th 5-fold cross-validated predictions from each of the models 1-6, consider these predictions from the 6 models as 6 different predictors, and fit the metamodel. Once computed, these cross-validated predictions can be used with different metamodels without the need of fitting the individual models repeatedlty with StackingRegressor().

(8 points)

E.2.7 Stacking ensemble with Lasso

Develop a lasso metamodel based on models 1-6 in E.2.4. Tune the hyperparameter C for the lasso metamodel. Report the MAE of the metamodel on test data.

(6 points)

E.2.8 Stacking ensemble with MARS

Develop a MARS metamodel based on models 1-6 in E.2.4. Take max_degree = 1. In general, the optimal degree of a MARS metamodel will be 1. This is because the metamodel is based on very strong predictors, and thus increasing its complexity is likely to overfit. Of course, in rare cases the optimal degree may be greater than 1. Report the MAE of the metamodel on test data.

(4 points)

E.2.9 Stacking ensemble with Random Forest

Develop a Random forest metamodel based on models 1-6 in E.2.4. Tune the max_samples hyperparameter of the metamodel. Report the MAE of the metamodel on test data.

(6 points)

E.2.10 Stacking ensemble with XGBoost

Develop a XGBoost metamodel based on models 1-6 in E.2.4. Tuning the metamodel is optional. Report the MAE of the metamodel on test data.

(5 points)

E.2.11 Ensemble of ensembles

Develop a voting ensemble of the previous 5 stacking ensembles (i.e., the stacking ensembles in E.2.6, E.2.7, E.2.8, E.2.9, and E.2.10). Report the MAE of the meta-metamodel on test data.

This must be your best model with the least MAE, which must be less than $41,500.

(5 points)

E.3 Classification - Term deposit

The data for this question is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit.

There is a train data - train.csv, which you will use to develop a model. There is a test data - test.csv, which you will use to test your model. Each dataset has the following attributes about the clients called in the marketing campaign:

  1. age: Age of the client

  2. education: Education level of the client

  3. day: Day of the month the call is made

  4. month: Month of the call

  5. y: did the client subscribe to a term deposit?

  6. duration: Call duration, in seconds. This attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for inference purposes and should be discarded if the intention is to have a realistic predictive model.

(Raw data source: Source. Do not use the raw data source for this assignment. It is just for reference.)

E.3.1 Data preparation

Convert all the categorical predictors in the data to dummy variables. Note that month and education are categorical variables.

(1 point)

E.3.2 Voting ensemble - hard voting

Develop a voting ensemble (hard voting) based on the models in:

  1. Tuned Generalized additive model in B.4

  2. Tuned Random Forest model in C.2.3

  3. Tuned boosting model in D.3.2

Report the accuracy and recall on test data for each of the individual models (1-3), and the hard voting ensemble.

(7 points - 3 points for reporting the accuracy and recall for the individual models, 2 points for taking the majority vote of predicted class, 2 points for reporting accuracy and recall on test data)

E.3.3 Voting ensemble - soft voting

Develop a soft voting ensemble based on the models E.3.2. Tune the decision threshold probability of the soft-voting ensemble to achieve the highest possible accuracy for a minimum recall of 65% on test (unseen) data. However, test data much be untouched while tuning. Report the accuracy and recall of the soft-voting ensemble on test data.

Note:

1. Use the cross-validated predicted probablities of models 1-3 in E.3.2 to find the average predicted probability.

2. Plot the cross-validated accuracy and recall against decision threshold probability. Tune the decision threshold probability based on the plot, or the data underlying the plot to achieve the required trade-off between recall and accuracy.

(8 points - 3 points for computing the average probability, 3 points for tuning the decision threshold probability, 2 points for reporting the accuracy and recall on test data)

E.3.4 Stacking ensemble - Logistic regression

Develop and tune a stacking ensemble based on them models 1-3 in E.3.2 with logistic regression as the metamodel. Tune the hyperparameter C and the decision threshold probability to maximize accuracy for a recall of at least 65%.

Report the accuracy and recall on test data for the ensemble.

(8 points - 3 points for tuning ‘C’, 3 points for tuning the decision threshold probability, 2 points for reporting accuracy and recall on test data)

E.3.5 Stacking ensemble - Random forest

Develop and tune a stacking ensemble based on them models 1-3 in E.3.2 with random forest as the metamodel. Tune the hyperparameter max_features to maximize accuracy for a recall of at least 65%.

Report the accuracy and recall on test data for the ensemble.

(8 points - 3 points for tuning ‘max_features’, 3 points for tuning the decision threshold probability, 2 points for reporting accuracy and recall on test data)